-
Notifications
You must be signed in to change notification settings - Fork 12
Conversation
Should improve readability
Very unsuccessful
Requires custom_timeout binary to be in the PATH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jankrepl , thank you for this immense work!
I added a few comments, but I think the overall structure looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, amazing work with the refactoring!
I left two minor suggestions, then feel free to merge it for real 🔥
if config_path: | ||
if not pathlib.Path(config_path).exists(): | ||
raise ValueError(f"The configuration path {config_path} does not exist!") | ||
|
||
config = luigi.configuration.get_config() | ||
config.add_config_path(config_path) | ||
config.reload() | ||
|
||
if luigi_config: | ||
config = luigi.configuration.get_config() | ||
for param in luigi_config.split(","): | ||
change = re.split(r"[.:]", param, maxsplit=3) | ||
config.set(*change) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need both config-path
and luigi-config
CLI arguments? I am not sure I understand the difference.
Also, isn't it easier/better to remove them and just rely on the env variable LUIGI_CONFIG_PATH
(see luigi
Docs)? This is for instance what BBP's mcar
does, see e.g. how we are invoking it here in another project.
Then we would be able to run the entrypoint like this:
LUIGI_CONFIG_PATH=some/path/luigi.cfg bbs_database run [args...]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config-path
is there to change the path to the configuration file.luigi-config
is to allow the user to overwrite a parameter (if needed). e.g.--luigi-config GlobalParams.source:arxiv
.
If you prefer, I can remove those two arguments. Maybe just a side note related to this: I have the feeling it is much easier to test the complete pipeline with those two arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestions:
- Check what happens if a user provides
config-path
but thenLUIGI_CONFIG_PATH
is also set as an environmental variable. What has precedence? - Rename
config-path
asluigi-config-path
+ make parser helper more explicit (help="Configuration Path."
-->help="Path to Luigi configuration file."
) - Rename
luigi-config
asluigi-config-args
+ make parser helper more explicit (help="Configuration parameters."
-->help="Comma separated key-value arguments for Luigi configuration, e.g. "--luigi-config GlobalParams.source:arxiv,DownloadTask.from-month:2021-04" . Overwrites the content of Luigi configuration file (see --luigi-config-path)."
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 0d343d3.
For info, the luigi-config-path
has precedence over the environment variable.
Fixes #562 and fixes #558
Description
Diagrams
Folder naming logic
Pipeline steps
Important design choices
$output_dir/$source/$date/
output_dir
is specified via the CLI (--output-dir
)source
is also specified via the CLIdate
is a concatenation of the below two dates (using the_
separator for joining) (e.g.2022-02_2022-02-15
):from-month
(format%Y-%m
) (specified via the CLI)%Y-%m-%d
)PerformFilteringTask
creates symlinks to those articles that made it through the filtering and puts these symlinks tofiltered/
. This makes it very easy forParseTask
to ingest them.--mesh-topic-db=mesh_topic_db.json
-> Thebbs_database parse-mesh-rdf
is not included in the pipeline.local_scheduler=True
). It is mostly for debugging purposes. However, we might want to use the central scheduler in "production". See more info https://luigi.readthedocs.io/en/stable/central_scheduler.htmlWARNING
. Feel free to modify it to have more information on whatluigi
is doing.How to test?
First of all, initialize an empty database.
Make sure you define the topic rules in
rules_config.jsonl
. If you want all articles to be accepted (debugging) just doYou can create a helper script
launch_luigi.sh
.bbs_database run \ --source=pubmed \ --from-month=2022-02 \ --filter-config=rules_config.jsonl \ --output-dir=pipeline_results/ \ --mesh-topic-db=mesh_topic_db.json \ --db-url=my_db.sqlite \ --grobid-host=dgx1.bbp.epfl.ch \ --grobid-port=8070 \ $@
And you can then run
luigi
in 2 different waysDry run
PENDING
) and which are done (COMPLETED
).Real run
To speed up the pipelines (debugging purposes) consider creating the following script
custom_timeout
.You can then simply use it inside of
ExternalProgramTask
as the first argument. Very useful forDownloadTask
. Make sure that the Python process can find it in thePATH
environment variable.TODO + To discuss
timeout
but need to handle the nonzero exit code somehow)DownloadTask
so that there are not that many articlesluigi
vs our entrypoints)logging.INFO
. However, can be changed using the hardcoder "global" variableVERBOSITY
pubmed
(there is a PR for this)topic-extract
parse
biorxiv
andmedrxiv
work end-to-endluigi.cfg
)convert-pdf
--db-type=postgres
bbs_database run --dry-run
remove the parameters to make it more readableExternalProgramTask
with simplefrom bluesearch.entrypoint.database.? import run; run()
? Maybe we don't need to replace it at all.luigi
history workluigi
to requirementsrules.json
Please provide here instructions on how to test the changes introduced by this PR.
(if some changes cannot be tested by automated tests)
Checklist
(if it is not the case, please create an issue first).
(if needed)
whatsnew.rst
updated.(if needed)
setup.py
andrequirements.txt
updated with new dependencies.(if needed)
(if a function is added or modified)